This project implements a method for simplifying and reducing time series data via the identification of “pivot points” (defined here as a series of points selected from the data which determine a set of line segments from which the actual data does not deviate more than a given tolerance).
Two different algorithms are presented for calculating these points, depending on whether the full time series is available or if the points are being calculated in real time.
It is assumed that the representation of the actual data via these “pivot points” is more advantageous for certain applications than the full data (particularly for stock market price prediction, trend identification and recognition of patterns).
The following files are contained in the repository:
If you want to replicate this analysis, you can clone this repository and either 1) run the declaration_of_functions.R script to be able to use the find.pivot() and rstock() functions or 2) run the animate scripts to create animations based on randomly generated stock data.
The remainder of this file contains a detailed discussion of the implementation and characteristics. Animated gifs illustrate the algorithms and results.
The methods developed here were motivated by difficulties encountered in previous attempts to write scripts with explicit rules for making buy/sell decisions in the stock market.
Due to the highly dynamic nature of the price signals involved, the identification of patterns in the data is a complex task, which renders the definition of rules based on prominent features very difficult.
The usual solution to the problem of dealing with highly noisy price data is the use of longer price candles. However, this method presents a shortcoming in that it is not sensitive to varying levels of market volatility (since each candle represents the same amount of time).
In order to deal with these problems, the goal of this project is to define algorithms that reduce the data so that:
As a hypothesis (to be confirmed in a posterior analysis), 3 different advantages are expected from the use of this reduction technique:
Two different algorithms were developed to identify the pivot points for a given time series, depending on whether or not the full data is available at the time of calculation:
We now detail the two algorithms and illustrate their results.
As stated above, this approach requires knowledge of the entire data set at the time of calculation.
These are its steps:
- Define interval under analysis as the full interval between first and last available points
- Draw reference line segment between the two points
- Evaluate deviation between all the values in the segment and the reference line
- If any point deviates more than the specified tolerance, the point where the highest deviation is found will be defined as a pivot point
- For each new pivot point added, run steps 2-4 to recursively evaluate the two segments defined by cutting the interval defined in step 1 at the new pivot point
The animation below illustrates the calculation for several randomly generated stocks. They each cover a period of 960 days (roughly 4 years of daily prices), and each frame represents one iteration of the recursive process. The red points and line segments represent the pivot points and respective price approximation, and the green band represents the tolerance. The tolerance for all the examples is 10%.
From the images, it is evident that, given the 10% tolerance, a set of around 20 points is normally sufficient to represent the full set of 960 points in the original data.
In order to deal of the limitation of the recursive algorithm, namely that the entire data set must be known at the time of calculation, the sequential approach was developed.
In this method, the set of past pivot points is not affected by the addition of new data. All the algorithm does is evaluate whether the new point requires the creation of a new pivot point in order to keep all the original data within the specified tolerance.
These are the algorithm steps:
- Define interval under analysis (will begin at last pivot point and end at current value, if there is no last point, will use the first point in the time series)
- Draw reference line segment between the two points
- Evaluate deviation between all the values in the segment and the reference line
- If any point deviates more than the tolerance, the point where the highest deviation is found will be defined as a pivot point
- For each new value, repeat steps 1-4
The algorithm can be seen in action in the animation below. Here, the blue error bar at the last segment in each frame represents the current pivot point candidate, which is the point that further deviates from the reference line. This point is added to the list of pivot points when the deviation is larger than the given tolerance.
The obvious advantage with this approach is that the algorithm can be applied ad eternum, as new pivot points are simply added to the running list. The following animation illustrates this feature: